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Table 1 



Fragment of a labeled conversation (from the Switchboard corpus). 



Speaker 


Dialogue Act 


Utterance 


A 


Yes-No-Question 


So do you go to college right now? 


A 


Abandoned 


Are yo-, 


B 


Yes- Answer 


Yeah, 


B 


Statement 


it's my last year [laughter]. 


A 


Declarative-Question 


You're a, so you're a senior now. 


B 


Yes- Answer 


Yeah, 


B 


Statement 


I'm working on my projects trying to graduate [laughter]. 


A 


Appreciation 


Oh, good for you. 


B 


Backchannel 


Yeah. 


A 


Appreciation 


That's great, 


A 


Yes-No-Question 


um, is, is N C University is that, uh, State, 


B 


Statement 


N C State. 


A 


Signal-Non-Understanding 


What did you say? 


B 


Statement 


N C State. 



1. Introduction 



The ability to model and automatically detect discourse structure is an important step 
toward understanding spontaneous dialogue. While there is hardly consensus on ex- 
actly how discourse structure should be described, some agreement exists that a useful 
first level of analysis involves the identification of dialogue acts (DAs). A DA represents 
the meaning of an utterance at the level of illocutionar y force ( Austin 1962). Thus, a DA 
is approximate ly the equiva lent of the speech act of 3earle (1969), the conversational 
game move of Power (1979), or the adjacency pair part of Schegloff (196£ ) and Sacks, 
Schegloff, and Jefferson (1974). 

Table [l] shows a sample of the kind of discourse structure in which we are inter- 
ested. Each utterance is assigned a unique DA label (shown in column 2), drawn from 
a well-defined set (shown in Table ^). Thus, DAs can be thought of as a tag set that 
classifies utterances according to a combination of pragmatic, semantic, and syntactic 
criteria. The computational community has usually defined these DA categories so as 
to be relevant to a particular application, although efforts are under way to develop DA 
labeling systems that are d omain-independent, such as the Discourse Resource Initia- 
tive's DAMSL architecture ( |Core and Allen 199/j ). 

While not constituting dialogue understanding in any deep sense, DA tagging seems 
clearly useful to a range of applications. For example, a meeting summarizer needs 
to keep track of who said what to whom, and a conversational agent needs to know 
whether it was asked a question or ordered to do som ething. In rela ted work DAs are 
used as a first processin g step to infer dialogue games ( Carlson 1983 ; Levin and Moore 
1977; Levin et al. 1999), a sligh tly higher l evel unit that comprises a small number of 
DAs. Interactional dominance ( Linell 1990| ) might be measured more accurately using 
DA distributions than with simpler techniques, and could serve as an indicator of the 
type or genre of discourse at hand. In all these cases, DA labels would enrich the avail- 
able input for higher-level processing of the spoken words. Another important role of 
DA information could be feedback to lower-level processing. For example, a speech 
recognizer could be constrained by expectations of likely DAs in a given context, con- 
straining the potential recognition hypotheses so as to improve accuracy. 

The goal of this article is twofold: On the one hand, we aim to present a com- 
prehensive framework for modeling and automatic classification of DAs, founded on 
well-known statistical methods. In doing so, we will pull together previous approaches 
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Table 2 



The 42 dialogue act labels. DA frequencies are given as percentages of the total number of 
utterances in the overall corpus. 



Tag 


Example 


% 


Statement 


Me, I'm in the legal department. 


36% 


Backchannel/ Acknowledge 


Uh-huh. 


19% 


Opinion 


I think it's great 


13% 


Abandoned/Uninterpretable 


So, -1 


6% 


Agreement/ Accept 


That's exactly it. 


5% 


Appreciation 


I can imagine. 


2% 


Yes-No-Question 


Do you have to have any special training? 


2% 


Non-verbal 


<Laughter>,<Throatxlearing> 


2% 


Yes answers 


Yes. 


1% 


Conventional-closing 


Well, it's been nice talking to you. 


1% 


Wh-Question 


What did you wear to work today? 


1% 


No answers 


No. 


1% 


Response Acknowledgment 


Oh, okay. 


1% 


Hedge 


I don't know if I'm making any sense or not. 


1% 


Declarative Yes-No-Question 


So you can afford to get a house? 


1% 


Other 


Well give me a break, you know. 


1% 


Backchannel-Question 


Is that right? 


1% 


Quotation 


You can't be pregnant and have cats 


.5% 


Summarize / reformulate 


Oh, you mean you switched schools for the kids. 


.5% 


Affirmative non-yes answers 


It is. 


.4% 


Action-directive 


Why don't you go first 


.4% 


Collaborative Completion 


Who aren't contributing. 


.4% 


Repeat-phrase 


Oh,fajitas 


.3% 


Open-Question 


How about you? 


.3% 


Rhetorical-Questions 


Who would steal a newspaper? 


.2% 


Hold before answer/agreement 


I'm drawing a blank. 


.3% 


Reject 


Well, no 


.2% 


Negative non-no answers 


Uh, not a whole lot. 


.1% 


Signal-non-understanding 


Excuse me? 


.1% 


Other answers 


I don't know 


.1% 


Conventional-opening 


How are you? 


.1% 


Or-Clause 


or is it more of a company? 


.1% 


Dispreferred answers 


Well, not so much that. 


.1% 


3RD-PARTY-TALK 


My goodness, Diane, get down from there. 


.1% 


Offers, Options & Commits 


I'll have to check that out 


.1% 


Self-talk 


What's the word I'm looking for 


.1% 


DOWNPLAYER 


That's all right. 


.1% 


Maybe/ Accept-part 


Something like that 


<.1% 


Tag-Question 


Right? 


<A% 


Declarative Wh-Question 


You are what kind of buff? 


<.1% 


Apology 


I'm sorry. 


<.1% 


Thanking 


Hey thanks a lot 


<.1% 



as well as new ideas. For example, our model draws on the use of DA n-grams and 
the hidden Markov models of conversation present in earl ier work, such as Nagata 
and Morimoto (1993, 1994) and Woszczyna and Waibel (1994 ) (see Section^). However, 
our framework generalizes earlier models, giving us a clean probabilistic approach for 
performing DA classification from unreliable words and nonlexical evidence. For the 
speech recognition task, our framework provides a mathematically principled way to 
condition the speech recognizer on conversation context through dialogue structure, as 
well as on nonlexical information correlated with DA identity. We will present methods 
in a domain-independent framework that for the most part treats DA labels as an arbi- 
trary formal tag set. Throughout the presentation, we will highlight the simplifications 
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and assumptions made to achieve tractable models, and point out how they might fall 
short of reality. 

Second, we present results obtained with this approach on a large, widely avail- 
able corpus of spontaneous conversational speech. These results, besides validating the 
methods described, are of interest for several reasons. For example, unlike in most pre- 
vious work on DA labeling, the corpus is not task-oriented in nature, and the amount 
of data used (198,000 utterances) exceeds that in previous studies by at least an order of 
magnitude (see Table p"4|). 

To keep the presentation interesting and concrete, we will alternate between the 
description of general methods and empirical results. Section || describes the task and 
our data in detail. Section || presents the probabilistic modeling framework; a central 
component of this framework, the discourse grammar, is further discussed in Section ||. 
In Section [| we describe experiments for DA classification. Section ^ shows how DA 
models can be used to benefit speech recognition. Prior and related work is summarized 
in Section ^. Further issues and open problems are addressed in Section |[ followed by 
concluding remarks in Section |[ 



2. The Dialogue Act Labeling Task 



The domain we chose t o model is the Switchboard corpus of hu man-human conversa- 
tional telephone speech ( |Godfrey, Holliman, and McDaniel 1992 ) distributed by the Lin- 
guistic Data Consortium. Each conversation involved two randomly selected strangers 
who had been charged with talking informally about one of several, self-selected general- 
interest topics. To train our statistical models on this corpus, we combined an exten- 
sive effort in human hand-coding of DAs for each utterance, together with a variety of 
automatic and semiautomatic tools. Our data consisted of a substantial portion of the 
Switchboard waveforms and corresponding transcripts, totaling 1,155 conversations. 



2.1 Utterance Segmentation 

Before hand-labeling each utterance in the corpus with a DA, we needed to choose an 
utterance segmentation, as the raw Switchboard data is not segmented in a linguisti- 
cally consistent way. To expedite the DA labeling task and remain consistent with other 
Switchboard-based research efforts, we made use of a version of the corpus that had 
been hand-segmented into sentence -level units prior t o our own work and indepen- 
dently of our DA labeling system ( Meteer et al. 1995 ). We refer to the units of this 
segmentation as utterances. The relation between utterances and speaker turns is not 
one-to-one: a single turn can contain multiple utterances, and utterances can span more 
than one turn (e.g., in the case of backchanneling by the other speaker in mid-utterance). 
Each utterance unit was identified with one DA, and was annotated with a single DA 
label. The DA labeling system had special provisions for rare cases where utterances 
seemed to combine aspects of several DA types. 

Autom atic segmentation of spontaneous speech is an open research problem in its 
own right (Mast et al. 1996; ptolcke and Shriberg 1996 ). A rough idea of the difficulty 
of the segmentation problem on this corp us and using the sam e definition of utterance 
units can be derived from a recent study ( [Shriberg et al. 2000 ). In an automatic labeling 
of word boundaries as either utterance or nonboundaries using a combination of lexical 
and prosodic cues, we obtained 96% accuracy based on correct word transcripts, and 
78% accuracy with automatically r ecognized words. T he fact that the segmentation and 
labeling tasks are interdependent (Warnke et al. 1997; Finke et al. 199£) further compli- 
cates the problem. 

Based on these considerations, we decided not to confound the DA classification 
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task with the additional problems introduced by automatic segmentation and assumed 
the utterance-level segmentations as given. An important consequence of this decision 
is that we can expect utterance length and acoustic properties at utterance boundaries 
to be accurate, both of which turn out to be important features of DAs (Shriberg et al. 



1998, see also Section 5.2.1). 



2.2 Tag Set 

We chose to follow a recent standard for shallow discourse structure annotation, the 
Dialogue Act Markup in Several Layers (DAMSL) tag set, which was designed by the 
natural la nguage processing co mmunity under the auspices of the Discourse Resource 



Initiative (Core and Allen 1997). We began with the DAMSL markup system, but modi- 
fied it in several ways to make it more relevant to our corpus and task. DAMSL aims to 
provide a domain-independent framework for dialogue annotation, as reflected by the 
fact that our tag set can be mapped back to DAMSL categories (Jurafsky, Shriberg, and 
Biasca 1997). However, our labeling effort also showed that content- and task-related 
distinctions will always play an important role in effective DA labeling. 

The Switchboard domain itself is essentially "task-free," thus giving few external 
constraints on the definition of DA categories. Our primary purpose in adapting the tag 
set was to enable computational DA modeling for conversational speech, with possi- 
ble improvements to conversational speech recognition. Because of the lack of a specific 
task, we decided to label categories that seemed both inherently interesting linguis- 
tically and that could be identified reliably. Also, the focus on conversational speech 
recognition led to a certain bias toward categories that were lexically or syntactically 
distinct (recognition accuracy is traditionally measured including all lexical elements in 
an utterance). 

While the modeling techniques described in this paper are formally independent of 
the corpus and the choice of tag set, their success on any particular task will of course 
crucially depend on these factors. For different tasks not all the techniques used in this 
study might prove useful and others could be of greater importance. However, we be- 
lieve that this study represents a fairly comprehensive application of technology in this 
area and can serve as a point of departure and reference for other work. 

The resulting SWBD-DAMSL tag set was multidimensional; approximately 50 ba- 
sic tags (e.g., QUESTION, Statement) could each be combined with diacritics indicat- 
ing orthogonal information, for example, about whether or not the dialogue function 
of the utterance was related to Task-Management and Communication-Management. 
Approximately 220 of the many possible unique combinations of these codes were used 
by the coders ( flurafsky, Shriberg, and Biasca 1997 ). To obtain a system with somewhat 



higher interlabeler agreement, as well as enough data per class for statistical modeling 
purposes, a less fine-grained tag set was devised. This tag set distinguishes 42 mutu- 
ally exclusive utterance types and was used for the experiments reported here. Table || 
shows the 42 categories with examples and relative frequencies.^ While some of the orig- 
inal infrequent classes were collapsed, the resulting DA type distribution is still highly 
skewed. This occurs largely because there was no basis for subdividing the dominant 
DA categories according to task-independent and reliable criteria. 

The tag set incorporates both traditional sociolinguistic and discourse-theoretic no- 
tions, such as rhetorical relations and adjacency-pairs, as well as some more form-based 
labels. Furthermore, the tag set is structured so as to allow labelers to annotate a Switch- 



1 For the study focusing on prosodic modeling of DAs reported elsewhere (Shriberg et al. 1998 ), the tag set 
was further reduced to six categories. 
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board conversation from transcripts alone (i.e., without listening) in about 30 minutes. 
Without these constraints the DA labels might have included some finer distinctions, 
but we felt that this drawback was balanced by the ability to cover a large amount of 
data.0 

Labeling was carried out in a three-month period in 1997 by eight linguistics gradu- 
ate students at CU Boulder. Interlabeler agreement for the 42-label tag set used here was 
84%, resulting in a Kappa statistic of 0.80. The Kap pa statistic m easures agreem ent nor- 
malized for chance ( piegel and Castellan, Jr. 1988| ). As argued in Carletta (1996| ), Kappa 



values of 0.8 or higher are desirable for detecting associations between several coded 
variables; we were thus satisfied with the level of agreement achieved. (Note that, even 
though only a single variable, DA type, was coded for the present study, our goal is, 
among other things, to model associations between several instances of that variable, 
e.g., between adjacent DAs.) 

A total of 1,155 Switchboard conversations were labeled, comprising 205,000 utter- 
ances and 1.4 million words. The data was partitioned into a training set of 1,115 con- 
versations (1.4M words, 198K utterances), used for estimating the various components 
of our model, and a test set of 19 conversations (29K words, 4K utterances). Remaining 
conversations were set aside for future use (e.g., as a test set uncompromised of tuning 
effects). 

2.3 Major Dialogue Act Types 

The more frequent DA types are briefly characterized below. As discussed above, the 
focus of this paper is not on the nature of DAs, but on the computational framework for 
their recognition; full details of the DA tag set and numerous motivating examples can 



be found in a separate report ([urafsky, Shriberg, and Biasca 1997). 



Statements and Opinions. The most common types of utterances were STATEMENTS and 
OPINIONS. This split distinguishes "descriptive, narrative, or personal" statements (STATE- 
MENT) from "other-directed opinion statements" (OPINION). The distinction was de- 
signed to capture the different kinds of responses we saw to opinions (which are often 
countered or disagreed with via further opinions) and to statements (which more often 
elicit continuers or backchannels): 

Dialogue Act Example Utterance 



Statement Well, we have a cat, um, 

Statement He's probably, oh, a good two years old, 

big, old, fat and sassy tabby. 
Statement He's about five months old 
OPINION Well, rabbits are darling. 

OPINION I think it would be kind of stressful. 

OPINIONS often include such hedges as / think, I believe, it seems, and / mean. We com- 
bined the STATEMENT and OPINION classes for other studies on dimensions in which 



they did not differ (Shriberg et al. 1998). 



2 The effect of lacking acoustic information on labelin g amirary was asses sed by relabeling a subset of the 
data with listening, and was found to be fairly small ( phriberg et al. 1998 1. A conservative estimate based 
on the relabeling study is that for most DA types at most 2% ot the labels might have changed based on 
listening. The only DA types with higher uncertainty were BACKCHANNELS and Agreements, which 
are easily confused with each other without acoustic cues; here the rate of change was no more than 10%. 
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Table 3 

Most common realizations of backchannels in Switchboard. 

Frequency Form Frequency Form Frequency Form 

38% uh-huh 2%~~ yes 1% sure 

34% yeah 2% okay 1% urn 

9% right 2% oh yeah 1% huh-uh 

3% oh 1% huh 1% uh 



Questions. Questions were of several types. The Yes-No-Question label includes only 
utterances having both the pragmatic force of a yes-no-question and the syntactic mark- 
ings of a yes-no-question (i.e., subject-inversion or sentence-final tags). DECLARATIVE- 
QUESTIONS are utterances that function pragmatically as questions but do not have 
"question form." By this we mean that declarative questions normally have no wh-word 
as the argument of the verb (except in "echo-question" format), and have "declarative" 
word order in which the subject precedes the verb. See |Weber (1993 ) for a survey of 
declarative questions and their various realizations. 



Dialogue Act 



Example Utterance 



Yes-No-Question 
Yes-No-Question 
Yes-No-Question 

Declarative-Question 
Wh-Question 



Do you have to have any special training? 
But that doesn't eliminate it, does it? 
Uh, I guess a year ago you're probably 
watching C N N a lot, right? 
So you're taking a government course? 
Well, how old are you? 



Backchannels. A backchannel is a short utterance that plays discourse-structuring roles, 
e.g., indicating that the speaker should go on talking. These are usually referred to in 
the conversation analysis literature as "cont inuers" and have been studied extensively 
( fefferson 1984 ; Schegloff 1982 ; Yngve 1970 ). We expect recognition of backchannels to 
be useful because of their discourse-structuring role (knowing that the hearer expects 
the speaker to go on talking tells us something about the course of the narrative) and be- 
cause they seem to occur at certain kinds of syntactic boundaries; detecting a backchan- 
nel may thus help in predicting utterance boundaries and surrounding lexical material. 

For an intuition about what backchannels look like, Table || shows the most com- 
mon realizations of the approximately 300 types (35,827 tokens) of backchannel in our 
Switchboard subset. The following table shows examples of backchannels in the context 
of a Switchboard conversation: 



Speaker 



Dialogue Act 



Utterance 



B Statement 



A Backchannel 

B Statement 

B Statement 

A Backchannel 

B Statement 

B Statement 



Appreciation 



but, uh, we're to the point now where our 
financial income is enough that we can consider 
putting some away - 
Uh-huh. / 

-for college, / 

so we are going to be starting a regular payroll 
deduction - 
Urn. / 

— in the fall / 

and then the money that I will be making this 
summer we'll be putting away for the college 
fund. 
Um. Sounds good. 
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Turn Exits and Abandoned Utterances. Abandoned utterances are those that the speaker 
breaks off without finishing, and are followed by a restart. Turn exits resemble aban- 
doned utterances in that they are often syntactically broken off, but they are used mainly 
as a way of passing speakership to the other speaker. Turn exits tend to be single words, 
often so or or. 

Speaker Dialogue Act Utterance 
A Statement we're from, uh, I'm from Ohio / 

A STATEMENT and my wife's from Florida / 

A Turn-Exit so, -/ 

B Backchannel Uh-huh. / 



A HEDGE so, I don't know, / 

A Abandoned it's <lipsmack>, - / 

A STATEMENT I'm glad it's not the kind of problem I have to 

come up with an answer to because it's not - 



Answers and Agreements. Yes-Answers include yes, yeah, yep, uh-huh, and other varia- 
tions on yes, when they are acting as an answer to a YES-NO-QUESTION or DECLARATIVE- 
QUESTION. Similarly, we also coded NO- ANSWERS. Detecting ANSWERS can help tell 
us that the previous utterance was a YES-NO-QUESTION. Answers are also semantically 
significant since they are likely to contain new information. 

Agreement/Accept, Reiect, and Maybe /Accept- Part all mark the degree to 
which a speaker accepts some previous proposal, plan, opinion, or statement. The most 
common of these are the Agreement/ ACCEPTS. These are very often yes or yeah, so 
they look a lot like ANSWERS. But where answers follow questions, agreements often 
follow opinions or proposals, so distinguishing these can be important for the discourse. 



3. Hidden Markov Modeling of Dialogue 



We will now describe the mathematical and computational framework used in our 
study. Our goal is to perform DA classification and other tasks using a probabilistic for- 
mulation, giving us a principled approach for combining multiple knowledge sources 
(using the laws of probability), as well as the ability to derive model parameters auto- 
matically from a corpus, using statistical inference techniques. 

Given all available evidence E about a conversation, the goal is to find the DA 
sequence U that has the highest posterior probability P(U\E) given that evidence. Ap- 
plying Bayes' Rule we get 

U* = &rgmSixP(U\E) 
u 

P(U)P(E\U) 
= arg r X P(E) 

= argmax P(U)P(E\U) (1) 
u 

Here P(U) represents the prior probability of a DA sequence, and P(E\U) is the likeli- 
hood of U given the evidence. The likelihood is usually much more straightforward to 
model than the posterior itself. This has to do with the fact that our models are genera- 
tive or causal in nature, i.e., they describe how the evidence is produced by the under- 
lying DA sequence U. 

Estimating P(U) requires building a probabilistic discourse grammar, i.e., a statis- 
tical model of DA sequences. This can be done using familiar techniques from language 
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Table 4 

Summary of random variables used in dialogue modeling. (Speaker labels are introduced in 

Section B.) 

Symbol Meaning 

U sequence of DA labels 

E evidence (complete speech signal) 

F prosodic evidence 

A acoustic evidence (spectral features used in ASR) 
W sequence of words 
T speakers labels 



modeling for speech recognition, although the sequenced objects in this case are DA 
labels rather than words; discourse grammars will be discussed in detail in Section [|. 

3.1 Dialogue Act Likelihoods 

The computation of likelihoods P(E\U) depends on the types of evidence used. In our 
experiments we used the following sources of evidence, either alone or in combination: 

Transcribed words: The likelihoods used in Equation |] are P(W\U), where W refers to 
the true (hand-transcribed) words spoken in a conversation. 

Recognized words: The evidence consists of recognizer acoustics A, and we seek to 
compute P(A\U). As described later, this involves considering multiple alter- 
native recognized word sequences. 

Prosodic features: Evidence is given by the acoustic features F capturing various as- 
pects of pitch, duration, energy, etc., of the speech signal; the associated likeli- 
hoods are P(F\U). 

For ease of reference, all random variables used here are summarized in Table ||. The 
same variables are used with subscripts to refer to individual utterances. For example, 
Wi is the word transcription of the ith utterance within a conversation (not the ith word). 

To make both the modeling and the search for the best DA sequence feasible, we 
further require that our likelihood models are decomposable by utterance. This means 
that the likelihood given a complete conversation can be factored into likelihoods given 
the individual utterances. We use Ui for the ith DA label in the sequence U, i.e., U — 
(Ui, . . . , Ui, . . . , U n ), where n is the number of utterances in a conversation. In addition, 
we use Ei for that portion of the evidence that corresponds to the ith utterance, e.g., the 
words or the prosody of the ith utterance. Decomposability of the likelihood means that 

P(E\U) = P(£i|*7i) ■ . . . • P(E n \U n ) (2) 

Applied separately to the three types of evidence A ir Wi and Fi mentioned above, it 
is clear that this assumption is not strictl y true. For example, speake rs tend to reuse 



words found earlier in the conversation (Fowler and Housum 1987) and an answer 



might actually be relevant to the question before it, violating the independence of the 
P(Wi\Ui). Similarly, speakers adjust their pitch or volume ov er time, e.g., to the con 



versation partner or because of the structure of the discourse ( Menn and Boyce 1982 ), 
violating the independence of the P(Fi\Ui). As in other areas of statistical modeling, 
we count on the fact that these violations are small compared to the properties actually 
modeled, namely, the dependence of Ei on [/,-. 

3.2 Markov Modeling 

Returning to the prior distribution of DA sequences P(U), it is convenient to make 
certain independence assumptions here, too. In particular, we assume that the prior 
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Ei Ei E n 

T T T 

jstart^ — ► Ui — > • • ■ — ► Ui — ► ■ ■ ■ — > U n — > jend,; 
Figure 1 

The discourse HMM as Bayes network. 



distribution of U is Markovian, i.e., that each Ui depends only on a fixed number k of 
preceding DA labels: 

P(U i \U li ... ) Ui- 1 )=P{U i \U i - k ,...,U i ^ 1 ) (3) 

(k is the order of the Markov process describing U). The n-gram-based discourse gram- 
mars we used have this property. As described later, k = 1 is a very good choice, i.e., 
conditioning on the DA types more than one removed from the current one does not 
improve the quality of the model by much, at least with the amount of data available in 
our experiments. 

The importance of the Markov assumption for the discourse grammar is that we 
can now view the whole system of discourse grammar and local utterance-based like- 



lihoods as a fcth-order hidden Markov model (HMM) (Rabiner and Juang 1986). The 
HMM states correspond to DAs, observations correspond to utterances, transition prob- 
abilities are given by the discourse grammar (see Section ||), and observation probabili- 
ties are given by the local likelihoods P{Ei\Ui). 

We can represent the dependency structure (as well as the implie d conditional in- 
dependences) as a special case of Bayesian belief network ( Pearl 1988 ). Figure || shows 



the variables in the resulting HMM with directed edges representing conditional de- 
pendence. To keep things simple, a first-order HMM (bigram discourse grammar) is 
assumed. 

3.3 Dialogue Act Decoding 

The HMM representation allows us to use efficient dynamic programming algorithms 
to compute relevant aspects of the model, such as 

• the most probable DA sequence (the Viterbi algorithm) 

• the posterior probability of various DAs for a given utterance, after 
considering all the evidence (the forward-backward algorithm) 



The Viterbi algorithm for HMMs ( Viterbi 196/| ) finds the globally most probable 



state sequence. When applied to a discourse model with locally decomposable likeli- 
hoods and Markovian discourse grammar, it will therefore find precisely the DA se- 
quence with the highest posterior probability: 

U* = argrnaxP{U\E) (4) 

u 

The combination of likelihood and prior modeling, HMMs, and Viterbi decoding is fun- 
damentally the same as the sta ndard probabi listic approac hes to speech recognition 



(Bahl, Jelinek, and Mercer 1983) and tagging ( |Church 198£ ). It maximizes the proba 



bility of getting the entire DA sequence corre ct, but it does not necessarily fin d the DA 



sequence that has the most DA labels correct ( Derma tas and Kokkinakis 1995). To mini- 
mize the total number of utterance labeling errors, we need to maximize the probability 
of getting each DA label correct individually, i.e., we need to maximize P(Ui\E) for each 
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i = 1, . . . , ri. We can compute the per-utterance posterior DA probabilities by summing: 

P(u\E)= P ( U \ E ) (5) 



U:U,=u 



where the summation is over all sequences U whose ith element matches the label in 
question. T he summation is efficiently carried out by the forward-backward algorithm 
for HMMs ( |Baum et al. 1970| )f| 



For Oth-order (unigram) discourse grammars, Viterbi decoding and forward-backward 
decoding necessarily yield the same results. However, for higher-order discourse gram- 
mars we found that forward-backward decoding consistently gives slightly (up to 1% 
absolute) better accuracies, as expected. Therefore, we used this method throughout. 

The formulation presented here, as well as all our experiments, uses the entire con- 
versation as evidence for DA classification. Obviously, this is possible only during offline 
processing, when the full conversation is available. Our paradigm thus follows histori- 
cal practice in the Switchboard domain, where the goal is typically the offline processing 
(e.g., automatic transcription, speaker identification, indexing, archival) of entire previ- 
ously recorded conversations. However, the HMM formulation used here also supports 
computing posterior DA probabilities based on partial evidence, e.g., using only the 
utterances preceding the current one, as would be required for online processing. 

4. Discourse Grammars 

The statistical discourse grammar models the prior probabilities P(U) of DA sequences. 
In the case of conversations for which the identities of the speakers are known (as 
in Switchboard), the discourse grammar should also model turn-taking behavior. A 
straightforward approach is to model sequences of pairs (Ui,Ti) where Ui is the DA 
label and Ti represents the speaker. We are not trying to model speaker idiosyncrasies, 
so conversants are arbitrarily identified as A or B, and the model is made symmetric 
with respect to the choice of sides (e.g., by replicating the training sequences with sides 
switched). Our discourse grammars thus had a vocabulary of 42 x 2 = 84 labels, plus 
tags for the beginning and end of conversations. For example, the second DA tag in Ta- 
ble [l] would be predicted by a trigram discourse grammar using the fact that the same 
speaker previously uttered a YES-NO-QUESTION, which in turn was preceded by the 
start-of-conversation. 

4.1 N-gram Discourse Models 

A computationally convenient type of discourse grammar is an n-gram model based 
on DA tags, as it allows effici ent decodi ng in the HMM framework. We trained stan- 



dard backoff n-gram m odels ( Katz 1987 ), using the frequency smoothing approach of 
Witten and Bell (1991 >. Models of various orders were compared by their perplexities, 
i.e., the average number of choices the model predicts for each tag, conditioned on the 
preceding tags. 

Table H shows perplexities for three types of models: P(U), the DAs alone; P(U,T), 
the combined DA/speaker ID sequence; and P(U\T), the DAs conditioned on known 
speaker IDs (appropriate for the Switchboard task). As expected, we see an improve- 
ment (decreasing perplexities) for increasing n-gram order. However, the incremental 
gain of a trigram is small, and higher-order models did not prove useful. (This observa- 



3 We note in passing t hat the Viter bi and Baum algorithms have equivalent formulations in the Bayes 



network framework ( Pearl 1988 1. The HMM terminology was chosen here mainly for historical reasons. 



11 



Computational Linguistics 



Volume 26, Number 3 



Table 5 

Perplexities of DAs with and without turn information. 

Discourse Grammar P(U) P{U,T) P([7jTy~ 

None 42 84 42 

Unigram 11.0 18.5 9.0 

Bigram 7.9 10.4 5.1 

Trigram 7.5 9.8 4.8 



tion, initially based on perplexity, is confirmed by the DA tagging experiments reported 
in Section g.) Comparing P(U) and P(U\T), we see that speaker identity adds substan- 
tial information, especially for higher-order models. 

The relatively small improvements from higher-order models could be a result of 
lack of training data, or of an inherent independence of DAs from DAs further removed. 
The near-optimality of the bigram discourse grammar is plausible given conversation 
analysis accounts of discourse stru cture in terms of adjacency pairs ( |5chegloff 196S ; 
[Sacks, Schegloff, and Jefferson 1974| ). Inspection of bigram probabilities estimated from 
our data revealed that conventional adjacency pairs receive high probabilities, as ex- 
pected. For example, 30% of YES-NO-QUESTIONS are followed by YES-ANSWERS, 14% 
by NO- ANSWERS (confirming that the latter are dispref erred). COMMANDS are followed 
by AGREEMENTS in 23% of the cases, and STATEMENTS elicit BACKCHANNELS in 26% 
of all cases. 



4.2 Other Discourse Models 

We also investigated non-n-gram discourse models, based on various language model- 
ing techniques known from speech recognition. One motivation for alternative models 
is that n-grams enforce a one-dimensional representation on DA sequences, whereas 
we saw above that the event space is really multidimensional (DA label and speaker 
labels). Another motivation is that n-grams fail to model long-distance dependencies, 
such as the fact that speakers may tend to repeat certain DAs or patterns throughout the 
conversation. 

The first alternative approach was a standard cache model ( Kuhn and de Mori 1990 ), 
which boosts the probabilities of previously observed unigrams and bigrams, on the 
theory that tokens tend to repeat themselves over longer distances. However, this does 
not seem to be true for DA sequences in our corpus, as the cache model showed no 
improvement over the standard n-gram. This result is somewhat surprising since uni- 
gram dialogue grammars are a ble to detec t speaker gender with 63% accuracy (over a 
50% baseline) on Switchboard ( Ries 1999b[ ), indicating that there are global variables in 
the DA distribution that could potentially be exploited by a cache dialogue grammar. 
Clearly, dialogue grammar adaptation needs further research. 

Second, we built a discourse grammar that incorporated constraints on DA se- 
quences in a nonhierarchical way, using maximum entropy (ME) estimation (Berger, 
Delia Pietra, and Delia Pietra 1996). The choice of features was informed by similar ones 
commonly used in statistical language models, as well our general intuitions about po- 
tentially information-bearing elements in the discourse context. Thus, the model was 
designed so that the current DA label was constrained by features such as unigram 
statistics, the previous DA and the DA once removed, DAs occurring within a window 
in the past, and whether the previous utterance was by the same speaker. We found, 
however, that an ME model using n-gram constraints performed only slightly better 
than a corresponding backoff n-gram. 

Additional constraints such as DA triggers, distance-1 bigrams, separate encoding 
of speaker change and bigrams to the last DA on the same /other channel did not im- 
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prove relative to the trigram model. The ME model thus confirms the adequacy of the 
backoff n-gram approach, and leads us to conclude that DA sequences, at least in the 
Switchboard domain, are mostly characterized by local interactions, and thus modeled 
well by low-order n-gram statistics for this task. For more structured tasks this situation 
might be different. However, we have found no further exploitable structure. 

5. Dialogue Act Classification 

We now describe in more detail how the knowledge sources of words and prosody 
are modeled, and what automatic DA labeling results were obtained using each of the 
knowledge sources in turn. Finally, we present results for a combination of all knowl- 
edge sources. DA labeling accuracy results should be compared to a baseline (chance) 
accuracy of 35%, the relative frequency of the most frequent DA type (STATEMENT) in 



5.1 Dialogue Act Classification Using Words 

DA classification using words is based on the observation that different DAs use dis- 
tinctive word strings. It is known that certain cue words and phrases (Hirschberg and 
Litman 1993) can serve as explicit indicators of discourse structure. Similarly, we find 
distinctive correlations between certain phrases and DA types. For example, 92.4% of 
the uh-huh's occur in BACKCHANNELS, and 88.4% of the trigrams "<start> do you" oc- 
cur in YES-NO-QUESTIONS. To leverage this information source, without hand-coding 
knowledge about which words are indicative of which DAs, we will use statistical lan- 
guage models that model the full word sequences associated with each DA type. 

5.1.1 Classification from True Words. Assuming that the true (hand-transcribed) words 
of utterances are given as evidence, we can compute word-based likelihoods P ( W \ U) in 
a straightforward way, by building a statistical language model for each of the 42 DAs. 
All DAs of a particular type found in the training corpus were pooled, a nd a DA-sp ecific 



trigram model was estim ated using standard t echniques (Katz-backoff [ Katz 1987 ] with 



Witten-Bell discounting QWitten and Bell 1991Q ). 



5.1.2 Classification from Recognized Words. For fully automatic DA classification, the 
above approach is only a partial solution, since we are not yet able to recognize words 
in spontaneous speech with perfect accuracy. A standard approach is to use the 1-best 
hypothesis from the speech recognizer in place of the true word transcripts. While con- 
ceptually simple and convenient, this method will not make optimal use of all the infor- 
mation in the recognizer, which in fact maintains multiple hypotheses as well as their 
relative plausibilities. 

A more thorough use of recognized speech can be derived as follows. The classifi- 
cation framework is modified such that the recognizer's acoustic information (spectral 
features) A appear as the evidence. We compute P(A\U) by decomposing it into an 
acoustic likelihood P(A|W / ) and a word-based likelihood P(W\U), and summing over 
all word sequences: 

P(A\U) = J2P(A\W,U)P(W\U) 
w 



4 The frequency of STATEMENTS across all labeled data was slightly different, cf. Table ^[ 
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jstart^, — ► Ui — > ■■ ■ — > U l — > ■ ■ ■ — ► U n — ► jend^ 
Figure 2 

Modified Bayes network including word hypotheses and recognizer acoustics. 



Y^P(A\W)P{W\U) (6) 



The second line is justified under the assumption that the recognizer acoustics (typically, 
cepstral coefficients) are invariant to DA type once the words are fixed. Note that this 
is another approximation in our modeling. For example, different DAs with common 
words may be realized by different word pronunciations. Figure || shows the Bayes 
network resulting from modeling recognizer acoustics through word hypotheses under 
this independence assumption; note the added W, variables (that have to be summed 
over) in comparison to Figure [|. 

The acoustic likelihoods P(^4|l / F) correspond to the acoustic scores the recognizer 
outputs for every hypothesized word sequence W . The summation over all W must be 
approximated; in our experiments we summed over the (up to) 2,500 best hypotheses 
generated by the recognizer for each utterance. Care must be taken to scale the recog- 
nizer acoustic scores properly, i.e., to exponentiate the recognizer acoustic scores by 1/A, 
where A is the language model weight of the recognizer]^] 

5.1.3 Results. Table ^ shows DA classification accuracies obtained by combining the 
word- and recognizer-based likelihoods with the n-gram discourse grammars described 
earlier. The best accuracy obtained from transcribed words, 71%, is encouraging given 



a comparable human performance of 84% (the interlabeler agreement, see Section 2.2). 
We observe about a 21% relative increase in classification error when using recognizer 
words; this is remarkably small considering that the speech recognizer used had a word 
error rate of 41% on the test set. 

We also compared the n-best DA classification approach to the more straightfor- 
ward 1-best approach. In this experiment, only the single best recognizer hypothesis is 
used, effectively treating it as the true word string. The 1-best method increased classifi- 
cation error by about 7% relative to the n-best algorithm (61.5% accuracy with a bigram 



5 In a standard recognizer the total log score of a hypothesis Wi is computed as 

logP(A i |W r <) + AlogP(Wi)-/x|Wj| , 

where | Wi | is the number of words in the hypothesis, and both A and fj, are parameters optimized to 
minimize the word error rate. The word insertion penalty /i represents a correction to the language model 
that allows balancing insertion and deletion errors. The language model weight A compensates for 
acoustic scores variances that are effectively too large due to severe independence assumptions in the 
recognizer acoustic model. According to this rationale, it is more appropriate to divide all score 
components by A. Thus, in all our experiments, we computed a summand in Equation H whose logarithm 
was 

i log P(Aj | Wi) + log P(Wi | Ui) - ^ | Wi | . 
A A 

We found this approach to give better results than the standard multiplication of log P(W) by A. Note 

that for selecting the best hypothesis in a recognizer only the relative magnitudes of the score weights 

matter; however, for the summation in Equation (J the absolute values become important. The parameter 

values for A and were those used by the standard recognizer; they were not specifically optimized for 

the DA classification task. 
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Table 6 

DA classification accuracies (in %) from transcribed and recognized words (chance = 35%). 



Discourse Grammar 


True 


Recognized 


Relative Error Increase 


None 


54.3 


42.8 


25.2% 


Unigram 


68.2 


61.8 


20.1% 


Bigram 


70.6 


64.3 


21.4% 


Trigram 


71.0 


64.8 


21.4% 



discourse grammar). 

5.2 Dialogue Act Classification Using Prosody 

We also investigated prosodic information, i.e., information independent of the words 
as well as the standard recognizer acoustics. Prosody is important for DA recognition for 
two reasons. First, as we saw earlier, word-based classification suffers from recognition 
errors. Second, some utterances are inherently ambiguous based on words alone. For 
example, some YES-NO-QUESTIONS have word sequences identical to those of State- 
ments, but can often be distinguished by their final F0 rise. 

A detailed study aimed at automatic prosodic classification of DAs in the Switch- 
board domain is available in a companion paper dShriberg et al. 1998[ ). Here we inves- 



tigate the interaction of prosodic models with the dialogue grammar and the word- 
based DA models discussed above. We also touch briefly on alternative machine learn- 
ing models for prosodic features. 

5.2.1 Prosodic Features. Prosodic DA classification was based on a large set of features 
computed automatically from the waveform, without reference to word or phone in- 
formation. The features can be broadly grouped as referring to duration (e.g., utterance 
duration, with and without pauses), pauses (e.g., total and mean of nonspeech regions 
exceeding 100 ms), pitch (e.g., mean and range of F0 over utterance, slope of F0 regres- 
sion line), energy (e.g., mean and range of RMS energy, same for signal-to-noise ratio 
[SNR]), speaking rate (based on the "enrate" measure of Morgan, Fosler, and Mirghafori 
[1997]), and gender (of both speaker and listener). In the case of utterance duration, the 
measure correlates both with length in words and with overall speaking rate. The gen- 
der feature that classified speakers as either male or female was used to test for potential 
inadequacies in F0 normalizations. Where appropriate, we included both raw features 
and values normalized by utterance and / or conversation. We also included features 



that are the output of the pitch accent and boundary tone event detector of Taylor (2000 ) 
(e.g., the number of pitch accents in the utterance). A complete description of prosodic 
features and an analysis of their usage in our models can be found in Shriberg et al. 
(1998). 

5.2.2 Prosodic Decision Trees. For our prosodic classifiers, we used CART-style deci- 



sion trees (Breiman et al. 1984). Decision trees allow combination of discrete and con- 
tinuous features, and can be inspected to help in understanding the role of different 
features and feature combinations. 

To illustrate one area in which prosody could aid our classification task, we applied 
trees to DA classifications known to be ambiguous from words alone. One frequent 
example in our corpus was the distinction between BACKCHANNELS and AGREEMENTS 
(see Table which share terms such as right and yeah. As shown in Figure ||, a prosodic 
tree trained on this task revealed that agreements have consistently longer durations 
and greater energy (as reflected by the SNR measure) than do backchannels. 

The HMM framework requires that we compute prosodic likelihoods of the form 
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403 \ cont_speech_frames_n >= 23.403 



ing_dur >= 0.415 




snr_mean_utt < 0.4774 \ snr_mean_utt >= 0.4774 



Figure 3 

Decision tree for the classification of BACKCHANNELS (B) and AGREEMENTS (A). Each node is 
labeled with the majority class for that node, as well as the posterior probabilities of the two 
classes. The following features are queried in the tree: number of frames in continuous (> 1 s) 
speech regions (cont_speech_frames), total utterance duration {ling Air), utterance duration 
excluding pauses > 100 ms (lingjlurjninusjninlOpause), and mean signal-to-noise ratio 
(snrjneanjitt). 



P(Fi\Ui) for each utterance Ui and associated prosodic feature values F{. We have the 
apparent difficulty that decision trees (as well as other classifiers, such as neural net- 
works) give estimates for the posterior probabilities, P(Ui\Fi). The problem can be over- 
come by applying Bayes' Rule locally: 



P{Fi\Ui) = P{Fi) 



P(Ui\Fi) P(U t \P) 



PiUi. 



P(Ui) 



(7) 



Note that P(Fi) does not depend on Ui and can be treated as a constant for the purpose 
of DA classification. A quantity proportional to the required likelihood can therefore be 
obtained either by dividing the posterior tree probability by the prior P(Ui )|] or by train- 
ing the tree on a uniform prior distribution of DA types. We chose the second approach, 
downsampling our training data to equate DA proportions. This also counteracts a com- 
mon problem with tree classifiers trained on very skewed distributions of target classes, 
i.e., that low-frequency classes are not modeled in sufficient detail because the majority 
class dominates the tree-growing objective function. 

5.2.3 Results with Decision Trees. As a preliminary experiment to test the integra- 
tion of prosody with other knowledge sources, we trained a single tree to discrimi- 
nate among the five most frequent DA types (Statement, Backchannel, Opinion, 
ABANDONED, and AGREEMENT, totaling 79% of the data) and an Other category com- 
prising all remaining DA types. The decision tree was trained on a downsampled train- 
ing subset containing equal proportions of these six DA classes. The tree achieved a 



i Bourlard and Morgan (1993 1 use this approach to integrate neural network phonetic models in a speech 
recognizer. 
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Table 7 

DA classification using prosodic decision trees (chance = 35%). 



Discourse Grammar 


Accuracy (%) 


None 


38.9 


Unigram 


48.3 


Bigram 


49.7 



Table 8 

Performance of various prosodic neural network classifiers on an equal-priors, six-class DA set 



(chance = 16.6%). 

Network Architecture Accuracy (%) 

Decision tree 45.4 

No hidden layer, linear output function 44.6 
No hidden layer, softmax output function 46.0 
40-unit hidden layer, softmax output function 46.0 



classification accuracy of 45.4% on an independent test set with the same uniform six- 
class distribution. The chance accuracy on this set is 16.6%, so the tree clearly extracts 
useful information from the prosodic features. 

We then used the decision tree posteriors as scaled DA likelihoods in the dialogue 
model HMM, combining it with various n-gram dialogue grammars for testing on our 
full standard test set. For the purpose of model integration, the likelihoods of the Other 
class were assigned to all DA types comprised by that class. As shown in Table |^, the 
tree with dialogue grammar performs significantly better than chance on the raw DA 
distribution, although not as well as the word-based methods (cf . Table ^). 

5.2.4 Neural Network Classifiers. Although we chose to use decision trees as prosodic 
classifiers for their relative ease of inspection, we might have used any suitable proba- 
bilistic classifier, i.e., any model that estimates the posterior probabilities of DAs given 
the prosodic features. We conducted preliminary experiments to assess how neural net- 
works compare to decision trees for the type of data studied here. Neural networks are 
worth investigating since they offer potential advantages over decision trees. They can 
learn decision surfaces that lie at an angle to the axes of the input feature space, unlike 
standard CART trees, which always split continuous features on one dimension at a 
time. The response function of neural networks is continuous (smooth) at the decision 
boundaries, allowing them to avoid hard decisions and the complete fragmentation of 
data associated with decision tree questions. 



Most important, however, related work ( |Ries 1999a| ) indicated that similarly struc- 
tured networks are superior classifiers if the input features are words and are therefore 
a plug-in replacement for the language model classifiers described in this paper. Neural 
networks are therefore a good candidate for a jointly optimized classifier of prosodic 
and word-level information since one can show that they are a generalization of the 
integration approach used here. 

We tested various neural network models on the same six-class downsampled data 
as used for decision tree training, using a variety of network architectures and output 
layer functions. The results are summarized in Table ||, along with the baseline result 
obtained with the decision tree model. Based on these experiments, a softmax network 



(Bridle 1990) without hidden units resulted in only a slight improvement over the deci- 
sion tree. A network with hidden units did not afford any additional advantage, even 
after we optimized the number of hidden units, indicating that complex combinations 
of features (as far as the network could learn them) do not predict DAs better than lin- 
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Figure 4 

Bayes network for discourse HMM incorporating both word recognition and prosodic features. 



ear combinations of input features. While we believe alternative classifier architectures 
should be investigated further as prosodic models, the results so far seem to confirm 
our choice of decision trees as a model class that gives close to optimal performance for 
this task. 

5.2.5 Intonation Event Likelihoods. An alternative way to compute prosodica lly based 



DA likelihoods uses pitch accents and boundary phrases ( Taylor et al. 1997 ). The ap 



proach relies on the intuition that different utterance types are characterized by dif- 



ferent intonational "tunes" (Kowtko 1996), and has been su ccessfully applied to th e 



classification of move types in the DCIEM Map Task corpus ( Wright and Taylor 1997 ). 
The system detects sequences of distinctive pitch patterns by training one continuous- 
density HMM for each DA type. Unfortunately, the event classification accuracy on the 
Switchboard corpus was considerably poorer than in the Map Task domain, and DA 
recognition results when coupled with a discourse grammar were substantially worse 
than with decision trees. The approach could prove valuable in the future, however, if 
the intonation event detector can be made more robust to corpora like ours. 

5.3 Using Multiple Knowledge Sources 

As mentioned earlier, we expect improved performance from combining word and 
prosodic information. Combining these knowledge sources requires estimating a com- 
bined likelihood P(Ai, Fi\Uj) for each utterance. The simplest approach is to assume 
that the two types of acoustic observations (recognizer acoustics and prosodic features) 
are approximately conditionally independent once Ui is given: 

P(A l7 W 4 ,F 2 |C/ 2 ) = PiAuWilUJPiFilA^W^Ui) 

« PiAi.Wip^PiFipi) (8) 

Since the recognizer acoustics are modeled by way of their dependence on words, it is 
particularly important to avoid using prosodic features that are directly correlated with 
word identities, or features that are also modeled by the discourse grammars, such as 
utterance position relative to turn changes. Figure || depicts the Bayes network incorpo- 
rating evidence from both word recognition and prosodic features. 

One important respect in which the independence assumption is violated is in the 
modeling of utterance length. While utterance length itself is not a prosodic feature, 
it is an important feature to condition on when examining prosodic characteristics of 
utterances, and is thus best included in the decision tree. Utterance length is captured 
directly by the tree using various duration measures, while the DA-specific LMs encode 
the average number of words per utterance indirectly through n-gra m parameters, bu t 



still accurately enough to violate independence in a significant way (Finke et al. 1998). 
As discussed in Section |8[ this problem is best addressed by joint lexical-prosodic mod 
els. 
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Table 9 

Combined utterance classification accuracies (chance = 35%). The first two columns correspond 
to Tables ^ and ^ respectively. 

Discourse Grammar Accuracy (%) 

Prosody Recognizer Combined 

None 38^9 42^ 565 

Unigram 48.3 61.8 62.4 

Bigram 49.7 64.3 65.0 



We need to allow for the fact that the models combined in Equation || give estimates 
of differing qualities. Therefore, we introduce an exponential weight a on P(Fi\Ui) that 
controls the contribution of the prosodic likelihood to the overall likelihood. Finally, a 
second exponential weight (3 on the combined likelihood controls its dynamic range 
relative to the discourse grammar scores, partially compensating for any correlation 
between the two likelihoods. The revised combined likelihood estimate thus becomes 

P(A^W t ,F % \U % ) « {PiA^W^PiF^TY (9) 

In our experiments, the parameters a and (3 were optimized using twofold jackknifing. 
The test data was split roughly in half (without speaker overlap), each half was used to 
separately optimize the parameters, and the best values were then tested on the respec- 
tive other half. The reported results are from the aggregate outcome on the two test set 
halves. 



5.3.1 Results. In this experiment we combined the acoustic n-best likelihoods based on 
recognized words with the Top-5 tree classifier mentioned in Section 5.2.3. Results are 
summarized in Table ^. 

As shown, the combined classifier presents a slight improvement over the recognizer- 
based classifier. The experiment without discourse grammar indicates that the com- 
bined evidence is considerably stronger than either knowledge source alone, yet this 
improvement seems to be made largely redundant by the use of priors and the discourse 
grammar. For example, by definition DECLARATIVE-QUESTIONS are not marked by syn- 
tax (e.g., by subject-auxiliary inversion) and are thus confusable with STATEMENTS and 
OPINIONS. While prosody is expected to help disambiguate these cases, the ambiguity 
can also be removed by examining the context of the utterance, e.g., by noticing that the 
following utterance is a Yes-Answer or No- ANSWER. 



5.3.2 Focused Classifications. To gain a better understanding of the potential for prosodic 
DA classification independent of the effects of discourse grammar and the skewed DA 
distribution in Switchboard, we examined several binary DA classification tasks. The 
choice of tasks was motivated by an analysis of confusions committed by a purely 
word-based DA detector, which tends to mistake QUESTIONS for STATEMENTS, and 
BACKCHANNELS for AGREEMENTS (and vice versa). We tested a prosodic classifier, a 
word-based classifier (with both transcribed and recognized words), and a combined 
classifier on these two tasks, downsampling the DA distribution to equate the class 
sizes in each case. Chance performance in all experiments is therefore 50%. Results are 
summarized in Table [H]. 

As shown, the combined classifier was consistently more accurate than the classifier 
using words alone. Although the gain in accuracy was not statistically significant for 
the small recognizer test set because of a lack of power, replication for a larger hand- 
transcribed test set showed the gain to be highly significant for both subtasks by a Sign 
test, p < .001 and p < .0001 (one-tailed), respectively. Across these, as well as additional 
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Table 10 

Accuracy (in %) for individual and combined models for two subtasks, using uniform priors 
(chance = 50%). 



Classification Task 


True Words 


Recognized Words 


Knowledge Source 




Questions/Statements 






prosody only 


76.0 


76.0 


words only 


85.9 


75.4 


words+prosody 


87.6 


79.8 


Agreements/Backchannels 






prosody only 


72.9 


72.9 


words only 


81.0 


78.2 


words+prosody 


84.7 


81.7 



subtasks, the relative advantage of adding prosody was larger for recognized than for 
true words, suggesting that prosody is particularly helpful when word information is 
not perfect. 

6. Speech Recognition 

We now consider ways to use DA modeling to enhance automatic speech recognition 
(ASR). The intuition behind this approach is that discourse context constrains the choice 
of DAs for a given utterance, and the DA type in turn constrains the choice of words. 
The latter can then be leveraged for more accurate speech recognition. 

6.1 Integrating DA Modeling and ASR 

Constraints on the word sequences hypothesized by a recognizer are expressed prob- 
abilistically in the recognizer language model (LM). It provides the prior distribution 
P(Wi) for finding the a posteri ori most probable hypothesized words for an utterance, 
given the acoustic evidence Aj ( Bahl, Jelinek, and Mercer 1983 ):f] 



W* = argmaxP(W i |A i ) 

Wi 

P(Wi)P(Ai\Wi) 

= a T ax — pW) — 

= tagmaxP(WdP(Ai\Wd (10) 

Wi 

The likelihoods P(Ai\Wi) are estimated by the recognizer's acoustic model. In a stan- 
dard recognizer the language model P(Wi) is the same for all utterances; the idea here 
is to obtain better-quality LMs by conditioning on the DA type since presumably the 
word distributions differ depending on DA type. 

W* = axgm&x P(Wi\Ai,Ui) 

P(Wi\UJP(Ai\Wi,Ui) 

= T — — 

« argmaxP(W l |C/ l )F(A 4 |^) (11) 

Wi 

7 Note the similarity of Equations ^and|. They are identical except for the fact that we are now operating 
at the level of an individual utterance, the evidence is given by the acoustics, and the targets are word 
hypotheses instead of DA hypotheses. 
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As before in the DA classification model, we tacitly assume that the words Wi depend 
only on the DA of the current utterance, and also that the acoustics are independent of 
the DA type if the words are fixed. The DA-conditioned language models P(Wi\Ui) are 
readily trained from DA-specific training data, much like we did for DA classification 
from words|] 

The problem with applying Equation [DJ of course, is that the DA type Ui is gen- 
erally not known (except maybe in applications where the user interface can be engi- 
neered to allow only one kind of DA for a given utterance). Therefore, we need to infer 
the likely DA types for each utterance, using available evidence E from the entire con- 
versation. This leads to the following formulation: 

W* = a,rgma,itP(W t \A l ,E) 

Wi 



= argmaxV P{Wi\A h U h E)P{Ui\E) 
w > It 

» argmaxV P{W l \A l ,U % )P{U l \E) (12) 
w > It 

The last step in Equation [l2| is justified because, as shown in Figures [j] and ^, the evi- 
dence E (acoustics, prosody, words) pertaining to utterances other than i can affect the 
current utterance only through its DA type U. 

We call this the mixture-of-posteriors approach, because it amounts to a mixture of 
the posterior distributions obtained from DA-specific speech recognizers (Equation [TT|), 
using the DA posteriors as weights. This approach is quite expensive, however, as it 
requires multiple full recognizer or rescoring passes of the input, one for each DA type. 

A more efficient, though mathematically less accurate, solution can be obtained by 
combining guesses about the correct DA types directly at the level of the LM. We esti- 
mate the distribution of likely DA types for a given utterance using the entire conversa- 
tion E as evidence, and then use a sentence-level mixture (Iyer, Ostendorf, and Rohlicek 
1994) of DA-specific LMs in a single recognizer run. In other words, we replace P(Wi | U) 
in Equation [jl] with 

Y,P(Wi\Ui)P(Ui\E!) , 

Ui 

a weighted mixture of all DA-specific LMs. We call this the mixture-of-LMs approach. 
In practice, we would first estimate DA posteriors for each utterance, using the forward- 
backward algorithm and the models described in Section [|, and then rerecognize the 
conversation or rescore the recognizer output, using the new posterior-weighted mix- 
ture LM. Fortunately, as shown in the next section, the mixture-of-LMs approach seems 
to give results that are almost identical (and as good) the mixture-of-posteriors ap- 
proach. 

6.2 Computational Structure of Mixture Modeling 

It is instructive to compare the expanded scoring formulae for the two DA mixture 
modeling approaches for ASR. The mixture-of-posteriors approach yields 



8 In Equation |ll| and elsewhere in this section we gloss over the issue of proper weighting of model 
probabilities, which is extremely important in practice. The approach explained in detail in footnote 
applies here as well. 
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Table 11 

Switchboard word recognition error rates and LM perplexities. 



Model 


WER (%) 


Perplexity 


Baseline 


41.2 


76.8 


1-best LM 


41.0 


69.3 


Mixture-of-posteriors 


41.0 


n/a 


Mixture-of-LMs 


40.9 


66.9 


Oracle LM 


40.3 


66.8 



whereas the mixture-of-LMs approach gives 

pm^m « fcpiwmpmm} P ^ff ■ ( 14 ) 

We see that the second equation reduces to the first under the crude approximation 
P(Ai\Ui) rs P(Ai). In practice, the denominators are computed by summing the nu- 
merators over a finite number of word hypotheses Wi, so this difference translates into 
normalizing either after or before summing over DAs. When the normalization takes 
place as the final step it can by omitted for score maximization purposes; this shows 
why the mixture-of-LMs approach is less computationally expensive. 

6.3 Experiments and Results 

We tested both the mixture-of-posteriors and the mixture-of-LMs approaches on our 
Switchboard test set of 19 conversations. Instead of decoding the data from scratch us- 
ing the modified models, we manipulated n-best lists consisting of up to 2,500 best 
hypotheses for each utterance. This approach is also convenient since both approaches 
require access to the full word string for hypothesis scoring; the overall model is no 
longer Markovian, and is therefore inconvenient to use in the first decoding stage, or 
even in lattice rescoring. 

The baseline for our experiments was obtained with a standard backoff trigram lan- 
guage model estimated from all available training data. The DA-specific language mod- 
els were trained on word transcripts of all the training utterances of a given type, and 
then smoothed further by interpolating them with the baseline LM. Each DA-specific 
LM used its own interpolation weight, obtained by minimizing the perplexity of the in- 
terpolated model on held-out DA-specific training data. Note that this smoothing step 
is helpful when using the DA-specific LMs for word recognition, but not for DA classifi- 
cation, since it renders the DA-specific LMs less discriminative.^] 

Table summarizes both the word error rates achieved with the various models 
and the perplexities of the corresponding LMs used in the rescoring (note that perplex- 
ity is not meaningful in the mixture-of-posteriors approach). For comparison, we also 
included two additional models: the "1-best LM" refers to always using the DA-specific 
LM corresponding to the most probable DA type for each utterance. It is thus an approx- 
imation to both mixture approaches where only the top DA is considered. Second, we 
included an "oracle LM," i.e., always using the LM that corresponds to the hand-labeled 
DA for each utterance. The purpose of this experiment was to give us an upper bound 
on the effectiveness of the mixture approaches, by assuming perfect DA recognition. 

It was somewhat disappointing that the word error rate (WER) improvement in 
the oracle experiment was small (2.2% relative), even though statistically highly signif- 



9 Indeed, during our DA classification experiments, we had observed that smoothed DA-specific LMs yield 
lower classification accuracy. 
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Table 12 

Word error reductions through DA oracle, by DA type. 
Dialogue Act Baseline WER Oracle WER WER Reduction 



No-Answer 29.4 11.8 -17.6 

Backchannel 25.9 18.6 -7.3 

Backchannel-Question 15.2 9.1 -6.1 

Abandoned /Uninterpretable 48.9 45.2 -3.7 

Wh-Question 38.4 34.9 -3.5 

Yes-No-Question 55.5 52.3 -3.2 

Statement 42.0 41.5 -0.5 

Opinion 40.8 40.4 -0.4 



Other 8% 



Statement 53% 




Figure 5 

Relative contributions to test set word counts by DA type. 



icant {p < .0001, one-tailed, according to a Sign test on matched utterance pairs). The 
WER reduction achieved with the mixture-of-LMs approach did not achieve statistical 
significance (0.25 > p > 0.20). The 1-best DA and the two mixture models also did 
not differ significantly on this test set. In interpreting these results one must realize, 
however, that WER results depend on a complex combination of factors, most notably 
interaction between language models and the acoustic models. Since the experiments 
only varied the language models used in rescoring, it is also informative to compare the 
quality of these models as reflected by perplexity. On this measure, we see a substantial 
13% (relative) reduction, which is achieved by both the oracle and the mixture-of-LMs. 
The perplexity reduction for the 1-best LM is only 9.8%, showing the advantage of the 
mixture approach. 

To better understand the lack of a more substantial reduction in word error, we 
analyzed the effect of the DA-conditioned rescoring on the individual DAs, i.e., group- 
ing the test utterances by their true DA types. Table 12 shows the WER improvements 
for a few DA types, ordered by the magnitude of improvement achieved. As shown, 
all frequent DA types saw improvement, but the highest wins were observed for typi- 
cally short DAs, such as ANSWERS and BACKCHANNELS. This is to be expected, as such 
DAs tend to be syntactically and lexically highly constrained. Furthermore, the distri- 
bution of number of words across DA types is very uneven (Figure ||). Statements 
and OPINIONS, the DA types dominating in both frequency and number of words (83% 
of total), see no more than 0.5% absolute improvement, thus explaining the small over- 
all improvement. In hindsight, this is also not surprising, since the bulk of the training 
data for the baseline LM consists of these DAs, allowing only little improvement in 
the DA-specific LMs. A more detailed anal ysis of the effect of DA modelin g on speech 
recognition errors can be found elsewhere ( Van Ess-Dykema and Ries 199£ ). 
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In summary, our experiments confirmed that DA modeling can improve word recog- 
nition accuracy quite substantially in principle, at least for certain DA types, but that the 
skewed distribution of DAs (especially in terms of number of words per type) limits the 
usefulness of the approach on the Switchboard corpus. The benefits of DA modeling 
might therefore be more pronounced on corpora with more even DA distribution, as is 
typically the case for task-oriented dialogues. Task-oriented dialogues might also fea- 
ture specific subtypes of general DA categories that might be constrained by discourse. 
Prior research on task-oriented dialogues summarized in the next section, however, has 
also found only small reductions in WER (on the order of 1%). This suggests that even in 
task-oriented domains more research is needed to realize the potential of DA modeling 
for ASR. 

7. Prior and Related Work 

As indicated in the introduction, our work builds on a number of previous efforts in 
computational discourse modeling and automatic discourse processing, most of which 
occurred over the last half -decade. It is generally not possible to directly compare quan- 
titative results because of vast differences in methodology, tag set, type and amount of 
training data, and, principally, assumptions made about what information is available 
for "free" (e.g., hand-transcribed versus automatically recognized words, or segmented 
versus unsegmented utterances). Thus, we will focus on the conceptual aspects of previ- 
ous research efforts, and while we do offer a summary of previous quantitative results, 
these should be interpreted as informative datapoints only, and not as fair comparisons 
between algorithms. 

Previous research on DA modeling has generally focused on task-oriented dialogue, 
with th ree tasks in particula r garnering muc h of the research effort. The Map Task 
corpus (Anderson et al. 1991; Bard et al. 1995| ) consists of conversations between two 



speakers with slightly different maps of an imaginary territory. Their task is to help one 
speaker reproduce a route drawn only on the other speaker's map, all without being 
able to see each oth er's maps. Of the DA modeling algorithms described below, Tay- 



lor et al. (1998) and |Wright (1998| ) were based on Map Task. The VERBMOBIL corpus 



consists of two-party scheduling dialogues. A number of the DA mo deling algorithm s 
described below we r e developed for VERBMOBIL, including those of |VIast et al. (19*96 ), 



|Warnke et al. (1997| ), |Reithinger et al. (1996| ), [Reithinger and Klesen (19971 ), and Samuel, 



Carberry, and Vijay-Shanker (1998). The ATR Conference corpus is a subset of a larger 
ATR Dialogue database consisting of simulated dialogues between a secretary and a 
questio ner at international conferenc es. Researc hers using this co rpus include Nagata 



(1992), iNagata and Morimoto (1993| , 1994), and [Kita et al. (1996| ). Table |13| shows the 
most commonly used versions of the tag sets from those three tasks. 

As discussed earlier, these domains differ from the Switchboard corpus in being 
task-oriented. Their tag sets are also generally smaller, but some of the same problems 
of balance occur. For example, in the Map Task domain, 33% of the words occur in 1 
of the 12 DAs (INSTRUCT). Table [l4| shows the approximate size of the corpora, the tag 
set, and tag estimation accuracy rates for various recent models of DA prediction. The 
results summarized in the table also illustrate the differences in inherent difficulty of the 



tasks. For example, the task of Warnke et al. (1997) was to simultaneously segment and 



tag DAs, whereas the other results rely on a prior manual segmentation. Similarly, the 
task in |Wright (1998| ) and in our study was to determine DA types from speech input, 
whereas work by others is based on hand-transcribed textual input. 

The use of n-grams to model the probabilities of DA sequences, or to predict up- 
coming DAs online, has been proposed by many authors. It seems to have been first 
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employed by Nagata (1992 ), and in follow-up papers by Nagata and Morimoto (1993 , 
1994) on the ATR Dialogue database. The model predicted upcoming DAs by using bi- 
grams and trigrams conditioned on preceding DAs, trained on a corpus of 2,722 DAs. 
Many others subsequently relied on and enhanced this n-grams-of-DAs approach, of- 
ten by applying standard techniques from statistical language modeling. Reithinger et 
al. (1996), for example, used deleted interpolation to smooth the dialogue n-grams. Chu- 
Carroll (1998) uses knowledge of subdialogue structure to selectively skip previous DAs 
in choosing conditioning for DA prediction. 

JNagata and Morimoto (LWd, 1994) may also have been the first to use word n- 
grams as a miniature grammar f or DAs, to be used in improving speech rec ognition. 
The idea cau ght on very quickly: Suhm and Waibel (1994 ), Mast et al. (1996 ), Warnke 
et al. (1997), Reithinger and Klesen (1997 ), and Taylor et al. (1998 ) all use variants of 
backoff, interpolated, or class n-gram language models to estimate DA likelihoods. Any 
kind of sufficiently powerful, trainable language model c ould perform this function, of 
course, and indeed [Alexandersson and Reithinger (1997 ) propose using a utoma tically 
learned stochastic context-free grammars. Jurafsky, Shriberg, Fox, and Curl ( 1998| ) show 
that the grammar of some DAs, such as appreciations, can be captured by finite-state 
automata over part-of-speech tags. 

iV-gram models are likelihood models for DAs, i.e., they compute the conditional 
probabilities of the word sequence given the DA ty pe. Word-based po sterior probability 
estimators are also possible, although less common. |Mast et al. (1996| ) propose the use 
of semantic class ification tree s, a kind of decision tree conditioned on word patterns as 
features. Finally, Ries (1999a ) shows that neural ne tworks using only u nigram features 
can be superior to higher-order n-gram DA models. Warnke et al. (1999) and Ohler, Har- 
beck, and Niemann (1999) use related discriminative training algorithms for language 
models. 

Woszczyna and Waibel (1994) and Suhm and Waibel (1994 ), followed by Chu-Carroll 
(1998), seem to have been the first to note that such a combination of word and dialogue 
n-grams could be viewed as a dialogue HMM with word strings as the o bservations. 
(Indeed, with the exception of Samuel, Carberry and Vijay-Shanker (1998 ), all models 
listed in Table [l4| rely on some version of this HMM metaphor.) Some researchers ex- 
plicitly used HMM induction techniques to infer dialogue grammars. Woszczyna and 
Waibel (1994), for example, tr ained an ergodic HMM using expectation-maximization to 
model speech act sequencing. |Kita et al. (1996 ) made one of the few attempts at unsuper- 
vised discovery of dialogue structure, where a finite-state grammar induction algorithm 
is used to find the topology of the dialogue grammar. 

Computational approaches to prosodic modeling of DAs have aimed to automati- 
cally extract various pros odic parameters — such a s duration, pitch, and e nergy patterns — 
from the speech signal ( Yoshimura et al. [1996 1; Taylor et al. [1997 ]; Kompe [1997 ], 
among others). Some approaches model FO patterns with techniques such as vector 
quantization and Gaussian classifiers to help disambiguate utterance types. An exten- 
sive comparison of t he prosodic DA modeling literature with our work can be found in 
[Shriberg et al. (1998j ). 

DA modeling has mostly been geared toward automatic DA classification, and much 
less work has been done on applying DA models to automatic speech recognition. Na- 
gata and Morimoto (1994) suggest c ond itioning word language models on DA s to lower 
perplexity. [Suhm and Waibel (1994 ) and Eckert, Gallwitz, and Niemann (1996] ) each con- 
dition a recognizer LM on left-to-right DA predictions and are able to show reductions 
in word error rate of 1% on task-oriented corpora. Most sim ilar to our own work, but 
still in a task-oriented domain, the work by Taylor et al. (1998 ) combines DA likelihoods 
from prosodic models with those from 1-best recognition output to condition the rec- 
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ognizer LM, again achieving an absolute reduction in word error rate of 1%, similarly 
disappointing as the 0.3% improvement in our experiments. 

Related computational tasks beyond DA classification an d speech recognition have 
received even less attention to date. We already mentioned Warnke et al. (1997) and 
Finke et al. (1998), who both showed that utterance segmentat ion and classification can 
be integrated into a single search process. [Fukada et al. (1998D investigate augmenting 
DA tagging with more detailed semantic "conce pt" tags, as a preli minary step toward 
an interlingua-based dialogue translation system. Levin et al. (1999| ) couple DA classifi- 
cation with dialogue game classification; dialogue games are units above the DA level, 
i.e., short DA sequences such as question-answer pairs. 

All the work mentioned so far uses statistical models of various kinds. As we have 
shown here, such models offer some fundamental advantages, such as modularity and 
composability (e.g., of discourse grammars with DA models) and the ability to deal 
with noisy input (e.g., from a speech recognizer) in a principled way. However, many 
other classifier architectures are applicable to the tasks discussed, in particular to DA 
classification. A nonprobabilistic approach for DA labeling propo sed by Sa muel, Car- 
berry, and Vijay-Shanker (1998) is transformation-based learning ( Brill 1993 ). Finally it 
should be noted that there are other tasks with a mathematical structure si milar to that 
of DA tagging, such as shallo w parsing for natural language pro cessing ( Munk 1999 ) 
and DNA classification tasks ( Ohler, Harbeck, and Niemann 1999 ), from which further 
techniques could be borrowed. 

How does the approach presented here differ from these various earlier models, 
particularly those based on HMMs? Apart from corpus and tag set differences, our ap- 
proach differs primarily in that it generalizes the simple HMM approach to cope with 
new kinds of problems, based on the Bayes network representations depicted in Fig- 
ures U and ||. For the DA classification task, our framework allows us to do classification 
given unreliable words (by marginalizing over the possible word strings correspond- 
ing to the acoustic input) and given nonlexical (e.g., prosodic) evidence. For the speech 
recognition task, the generalized model gives a clean probabilistic framework for condi- 
tioning word probabilities on the conversation context via the underlying DA structure. 
Unlike previous models that did not address speech recognition or relied only on an 
intuitive 1-best approximation, our model allows computation of the optimum word 
sequence by effectively summing over all possible DA sequences as well as all recogni- 
tion hypotheses throughout the conversation, using evidence from both past and future. 



8. Discussion and Issues for Future Research 



Our approach to dialogue modeling has two major components: statistical dialogue 
grammars modeling the sequencing of DAs, and DA likelihood models expressing the 
local cues (both lexical and prosodic) for DAs. We made a number of significant sim- 
plifications to arrive at a computationally and statistically tractable formulation. In this 
formulation, DAs serve as the hinges that join the various model components, but also 
decouple these components through statistical independence assumptions. Conditional 
on the DAs, the observations across utterances are assumed to be independent, and ev- 
idence of different kinds from the same utterance (e.g., lexical and prosodic) is assumed 
to be independent. Finally, DA types themselves are assumed independent beyond a 
short span (corresponding to the order of the dialogue n-gram). Further research within 
this framework can be characterized by which of these simplifications are addressed. 

Dialogue grammars for conversational speech need to be made more aware of the 
temporal properties of utterances. For example, we are currently not modeling the fact 
that utterances by the conversants may actually overlap (e.g., backchannels interrupt- 
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ing an ongoing utterance). In addition, we should model more of the nonlocal aspects of 
discourse structure, despite our negative results so far. For example, a context-free dis- 
course grammar could potentially account for the nested structures proposed in Grosz 
and Sidner (1986).Q 

The standard ra-gram models for DA discrimination with lexical cues are probably 
suboptimal for this task, simply because they are trained in the maximum likelihood 
framework, without explicitly optimizing discrimination be tween DA types. Th is may 



be overcome by using discriminative training procedures ( Warnke et al. 1999[ : Ohler, 



Harbec k, and Niem ann 1999). Training neural networks directly with posterior prob- 



ability flRies 1999a| ) seems to be a more principled approach and it also offers much 
easier integration with other knowledge sources. Prosodic features, for example, can 
simply be added to the lexical features, allowing the model to capture dependencies 
and redundancies across knowledge sources. Keyword-based techniques from the field 
of message classification should also be applicable here (Rose, Chang, and Lippmann 
1991). Eventually, it is desirable to integrate dialogue grammar, lexical and prosodic 
cues into a single model, e.g., one that predicts the next DA based on DA history and all 
the local evidence. 

The study of automatically extracted prosodic features for DA modeling is likewise 
only in its infancy. Our preliminary experiments with neural networks have shown that 
small gains are obtainable with improved statistical modeling techniques. However, we 
believe that more progress can be made by improving the underlying features them- 
selves, in terms of both better understanding of how speakers use them, and ways to 
reliably extract them from data. 

Regarding the data itself, we saw that the distribution of DAs in our corpus limits 
the benefit of DA modeling for lower-level processing, in particular speech recognition. 
The reason for the skewed distribution was in the nature of the task (or lack thereof) 
in Switchboard. It remains to be seen if more fine-grained DA distinctions can be made 
reliably in this corpus. However, it should be noted that the DA definitions are really 
arbitrary as far as tasks other than DA labeling are concerned. This suggests using un- 
supervised, self-organizing learning schemes that choose their own DA definitions in 
the process of optimizing the primary task, whatever it may be. Hand-labeled DA cate- 
gories may still serve an important role in initializing such an algorithm. 

We believe that dialogue-related tasks have much to benefit from corpus-driven, au- 
tomatic learning techniques. To enable such research, we need fairly large, standardized 
corpora that allow comparisons over time and across approaches. Despite its shortcom- 
ings, the Switchboard domain could serve this purpose. 

9. Conclusions 

We have developed an integrated probabilistic approach to dialogue act modeling for 
conversational speech, and tested it on a large speech corpus. The approach combines 
models for lexical and prosodic realizations of DAs, as well as a statistical discourse 
grammar. All components of the model are automatically trained, and are thus appli- 
cable to other domains for which labeled data is available. Classification accuracies 
achieved so far are highly encouraging, relative to the inherent difficulty of the task 
as measured by human labeler performance. We investigated several modeling alterna- 
tives for the components of the model (backoff n-grams and maximum entropy models 



10 The inadequacy of n-gram models for nested discourse structures is pointed out by Chu-Carroll (1998 ), 
although the suggested solution is a modified n-gram approach. 
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Table 13 

Dialogue act tag sets used in three other extensively studied corpora. 

VERBMOBIL. These 18 high-level DAs used in VERBMOBIL-1 are abstracted over a total of 43 
more specific DAs; most experimen ts on VERBMOBIL DAs use the set of 18 rather than 43. 
Examples are from tekat et al. (1995j ). 
Tag Example 



Thank 
Greet 
Introduce 
Bye 

Request-Comment 

Suggest 

Reject 

Accept 

Request- Suggest 
Init 

Give_Reason 

Feedback 

Deliberate 

Confirm 

Clarify 

Digress 

Motivate 

Garbage 



Thanks 

Hello Dan 

It's me again 

Alright bye 

How does that look? 

from thirteenth through seventeenth ]une 

No Friday I'm booked all day 

Saturday sounds fine, 

What is a good day of the week for you? 

I wanted to make an appointment with you 

Because I have meetings all afternoon 

Okay 

Let me check my calendar here 

Okay, that would be wonderful 

Okay, do you mean Tuesday the 23rd? 

[we could meet for lunch] and eat lots of ice cream 

We should go to visit our subsidiary in Munich 

Oops, I- 



MapTask. The 12 DAs or 
Tag 



'move types" used in Map Task. Examples are from Taylor et al. (1998 ) 
Example 



Instruct 

Explain 

Align 

Check 

Query- yn 

Query-w 

Acknowledge 

Clarify 

Reply-y 

Reply-n 

Reply-w 

Ready 



Go round, ehm horizontally underneath diamond mine 

I don't have a ravine 

Okay? 

So going down to Indian Country? 

Have you got the graveyard written down ? 

In where? 

Okay 

{you want to go... diagonally} Diagonally down 
I do. 

No, I don't 

{And across to?} The pyramid. 
Okay 



ATR. The 9 DAs ("illocutionary force types") used in the ATR Dialogue Database task; some later 
m odels used an extended set of 15 DAs. Examples are from the English translations given 
by |Nagata (1992| ). 
Tag Example 



Phatic 

Expressive 

Response 

Promise 

Request 

Inform 

Questionif 

questionref 

questionconf 



Hello 
Thank you 
That's right 

I will send you a registration form 

Please go to Kitaooji station by subway 

We are not giving any discount this time 

Do you have the announcement of the conference? 

What should I do? 

You have already transferred the registration fee, right? 
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Table 14 

Data on recent DA tagging experiments. The number of DA tokens reflects training set size; 
accuracy refers to automatic tagging correctness. The error rates should not be compared, since 
the tasks were quite different. The comment field indicates special difficulties due to the type of 
input data. 



Source 



Woszczyna and Waibel (1994 ) 



Number of DA Tokens Number of DA Types/Tag Set Accuracy Comments 



Nagata and Morimoto (1994 ) 
Keithinger et al. (1996|) 



Mast etal.(1995 ) 



Warnkeet aTTT997^ 



Keithinger and Klesen (1997) 



LJhu-(JarroUj I995? 



Wright (19981) 



laylor eta 1.(1^1 



->amuel, Carberrv. and Vijay-Shanker (1998 ) 



Fukada et al. (1998 



Present study 



150-250(?) 


6 


74.1% 


2,450 


15 / ATR 


39.7% 


6,494 


18 / VERBMOBIL 


w 40% 


6,494 


18 / VERBMOBIL 


59.7% 


6,494 


18 / VERBMOBIL 


53.4% 


2,701 


18 / VERBMOBIL 


74.7% 


915 


15 


49.71% 


3,276 


12 / Map Task 


64% 


9,272 


12 / Map Task 


47% 


2,701 


18 / VERBMOBIL 


75.12% 


3,584 


26 / C-Star (Japanese) 


81.2% 


1,902 


26 / C-Star (English) 


56.9% 


198,000 


42 / SWBD-DAMSL 


65% 



from speech 



from speech 
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for discourse grammars, decision trees and neural networks for prosodic classification) 
and found performance largely independent of these choices. Finally, we developed a 
principled way of incorporating DA modeling into the probability model of a contin- 
uous speech recognizer, by constraining word hypotheses using the discourse context. 
However, the approach gives only a small reduction in word error on our corpus, which 
can be attributed to a preponderance of a single dialogue act type (statements). 



Note 

The research described here is based on a project 
at the 1997 Workshop on Innovative Techniques 
in LVCSR at the Center for Speech and Lan- 
gua ge Processing at Tohns Hopkins Univer - 
sity furafsky et al. 199^ ; [[urafsky et al. 1998j ). 
The DA-labeled Switchboard transcripts as well 
as other project-related publications are avail- 



Garrod, Stephen D. Isard, Jacqueline C. 
Kowtko, Jan M. McAllister, Jim Miller, 
Catherine F. Sotillo, Henry S. Thompson, 
and Regina Weinert. 1991. The HCRC 
Map Task corpus. Language and Speech, 
34(4):351-366. 

[Austinl962] Austin, J. L. 1962. How to do 
Things with Words. Clarendon Press, 
Oxford. 



able at http: / /www.colorado.edu/ling/jurafsky/ TBahl Jdinek/ and Mercerl 9 83] Bahl/ Lalit R# 

w ' Frederick Jelinek, and Robert L. Mercer. 

1983. A maximum likelihood approach to 
continuous speech recognition. IEEE 
Transactions on Pattern Analysis and Machine 
Intelligence, 5(2):179-190, March. 

[Bard et al.1995] Bard, Ellen G., Catherine 
Sotillo, Anne H. Anderson, and M. M. 
Taylor. 1995. The DCIEM Map Task 
corpus: Spontaneous dialogues under 
sleep deprivation and drug treatment. In 
Isabel Trancoso and Roger Moore, editors, 
Proceedings of the ESCA-NATO Tutorial and 
Workshop on Speech under Stress, pages 
25-28, Lisbon, September. 

[Baum et al.1970] Baum, Leonard E., Ted 
Petrie, George Soules, and Norman Weiss. 
1970. A maximization technique occurring 
in the statistical analysis of probabilistic 
functions in Markov chains. The Annals of 
Mathematical Statistics, 41(1):164-171. 

[Berger, Delia Pietra, and Delia Pietral996] 
Berger, Adam L., Stephen A. Delia Pietra, 
and Vincent J. Delia Pietra. 1996. A 
maximum entropy approach to natural 
language processing. Computational 
Linguistics, 22(1):39-71. 

[Bourlard and Morganl993] Bourlard, Herve 
and Nelson Morgan. 1993. Connectionist 
Speech Recognition. A Hybrid Approach. 
Kluwer Academic Publishers, Boston, MA. 

[Breiman et al.1984] Breiman, L., J. H. 
Friedman, R. A. Olshen, and C. J. Stone. 

1984. Classification and Regression Trees. 
Wadsworth and Brooks, Pacific Grove, CA. 

[Bridlel990] Bridle, J. S. 1990. Probabilistic 
interpretation of feedforward classification 
network outputs, with relationships to 
statistical pattern recognition. In 
F. Fogleman Soulie and J. Herault, editors, 
Neurocomputing: Algorithms, Architectures 
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